Abstract Title Page 

Not included in page count. 


Title: Mapping U.S. school district test score distributions onto a oommon seale, 2008-2013 

Authors and Affiliations: Sean Reardon (Stanford University), Demetra Kalogrides (Stanford 
University), Andrew Ho (Harvard Graduate Sehool of Edueation), 


SREE Spring 2016 Conference Abstract Template 



Abstract Body 

Limit 4 pages single-spaced. 


Background / Context: 

Description of prior research and its intellectual context. 

U.S. school districts differ dramatically in their soeioeeonomic and demographie eharaeteristies 
(Reardon, Yun, & Eitle, 1999; Stroub & Riehards, 2013), and distriets have eonsiderable 
influenee over instruetional and organizational praetiees that may affeet aeademie aehievement 
(Whitehurst, Chingos, & Gallaher, 2013). Nonetheless, we have relatively little rigorous large- 
seale research describing national patterns of variation in aehievement aeross distriets, let alone 
an understanding of the faetors that eause this variation. Sueh analyses generally require distriet- 
level test seore distributions that are eomparable aeross states. Although these eomparisons are 
possible within states and for seleeted districts across states,/ a national distriet-level database of 
eomparable achievement scores does not exist. 

This paper evaluates a linking method applied to a unique national distriet-level dataset that may 
achieve the goal of national district-level eomparability for research purposes. We begin with a 
dataset eonstructed from an applieation of a method presented at this eonferenee last year (Shear, 
Castellano, Reardon, & Ho, 2014). The authors fit a heteroskedastie probit model to eategorieal 
test score data from the EDFaets Initiative (U.S. Department of Edueation, 2015), whieh ineludes 
frequencies of students in eoarse “aehievement levels” from every school district in the U.S. 
from 2008 to 2012. Data from 2013 are fortheoming. The authors demonstrate that 
reparameterization of eonventional model parameters reeover, with remarkable aecuracy, means 
and standard deviations of distriet test seores from fine-grained data. The method is useful 
beeause the latter data are seldom available in practiee. However, district means and standard 
deviations remain ineomparable aeross states, beeause state test seore eategories and their 
underlying test seore seales differ. 

We employ a linear linking method under the assumption that state test seore distributions have 
the same mean and standard deviation as their eounterparts in the National Assessment of 
Edueational Progress (NAEP) for the same subjeets, grades, and years. The baseline linking 
method is reviewed by Kolen and Brennan (2014). Hanushek and Woessman (2012) have 
employed similar methods for international eomparisons. Using NAEP as a basis for linking tests 
has been deemed infeasible for high-stakes student-level reporting (Feuer, Holland, Green, 
Bertenthal, & Hemphill, 1999); however, our goal is to support aggregate-level poliey analysis. 
For these purposes, as we demonstrate in this paper, we will treat the issue empirieally, using a 
variety of validation cheeks. 

Purpose / Objective / Research Question / Focus of Study: 

Description of the focus of the research. 


^ Limited cross-state district comparisons are possible with the Trial Urban District Assessment (TUDA), which 
reports scores for 20 large districts bienially on the common scale of the National Assessment of Educational 
Progress (NAEP). They are also possible with tests that have cross-state district-level adoption, like the Measures of 
Academic Progress (MAP) test from Northwest Evaluation Association (NWEA); however, participation is nowhere 
near national. The Common Core State Standards Assessment Consortia may eventually provide more but still not 
national comparisons. 
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Our goal is to map district test score means and standard deviations onto a common metric 
across states and evaluate the accuracy of the mapping. 

Setting: 

Description of the research location. 

The research setting is public elementary and middle school districts in the United States from 
2008 to 2013. 

Population / Participants / Subjects: 

Description of the participants in the study: who, how many, key features, or characteristics. 

The population consists of all U.S. public school students in grades 3-8 from 2008 to 2013. 

Intervention / Program / Practice: 

Description of the intervention, program, or practice, including details of administration and duration. 

State testing programs differ but are all mandated under the Elementary and Secondary 
Education Act to report individual student scores in ordinal proficiency categories in 
Mathematics and Reading or English Eanguage Arts annually in grades 3-8. States provide these 
scores by school, grade, and district, to the EDEacts database. States are also mandated to have a 
sample of their schools participate in the National Assessment of Educational Progress (NAEP) 
every other year, in Reading and Mathematics, grades 4 and 8. We will have state test score data 
from 2008 to 2013 and NAEP data from overlapping years, 2009, 2011, and 2013. 

Research Design: 

Description of the research design. 

Please see below. 

Data Collection and Analysis: 

Description of the methods for collecting and analyzing data. 

Analysis of EDEacts data using methods from Shear, Castellano, Reardon, and Ho (2014) yields 
estimates of district test score means and standard deviations on a common state scale with a 
state mean of 0 and a state variance of 1 . We refer to these estimated district means and standard 
deviations as fi^ayg% d^a^ygb, respectively, for district d, year y, grade g, and subject b. These 
methods also provide estimated standard errors of these estimates, a-state and a-state, 

^ t^dygb ^dygb 

respectively. We also gather reported reliability statistics from state test score technical manuals, 
Ps^gb- Through both the publicly accessible NAEP data explorer and the restricted-use data, we 
have estimates of NAEP means and standard deviations at the state (s) level, and d^yg^, 
respectively, as well as their standard errors. 

Under the assumption that NAEP and state test score means and variances should be the 
same, we first map district-subgroup means to the NAEP scale linearly, for overlapping 
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years and grades. Because district test score moments are already expressed on a state 
scale with mean 0 and unit variance, the expressions are simple: 


/vnaep _ /vnaep 
^^dygb ~ ^^sygb ' 
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Note that because district means on the state scale, il^yg%, are expressed in terms of 
standard deviation units of the state score distribution, they are attenuated due to 
measurement error. We disattenuate by dividing the square root of the state test score 
reliability estimate. Although the district standard deviations, are also inflated due 

to measurement error in an absolute sense, because they are expressed as a percentage of 
the state standard deviation, which is itself inflated, they need not be adjusted for 
unreliability. Treating the main terms as independent random variables, we can derive the 
standard errors of the linked means and standard deviations: 
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Linkages, on their own, are expressions of wishful thinking, in this ease that, had the distriet 
been sampled for NAEP, its resulting average seore and standard deviation on the NAEP seale 
would be the linked estimates above. We eonduet three additional analyses that assess the 
validity of the linked estimates for their intended researeh purposes. 

Pirst, NAEP reports scores for 17 state districts (TUDAs) in 2009 and 20 in 201 1 and 2013. Eor 
these particular large districts, we can compare the NAEP means and standard deviations to their 

linked means and standard deviations. Eor each district, we can obtain the discrepancy, “ 
Pdy^gb- report the average of these discrepancies as the bias, and we report the square root of 

the average squared discrepancies as the Root Mean Squared Error (RMSE). We also report the 
correlation between the two, as well as a disattenuated correlation that accounts for the standard 
error of each observation, d^nSep • 

f^dygb 

Second, we have access to a large database of scores from a testing program adopted at the 
school and district level across the country. In a large number of districts across many states, the 
number of student test scores exceeds 90% of district’s enrollment. We estimate means and 
standard deviations for these districts, and we report correlations and disattenuated correlations 
between these and the linked district estimates on the NAEP scale. This correlation is expected 
to be lower due to any divergence in the construct measured by this third party test and NAEP. 
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Third, Bandeira de Mello, Bohrnstedt, Blankenship, and Sherman (2015) report mappings of 
state profieieney standards on the NAEP seale using an equipereentile linkage, based on 
knowledge of the sehools sampled by NAEP and their profieieney rates. We evaluate the 
alignment between the state EDEacts population and the NAEP population by linearly mapping 
the estimated eut seore on the standardized state test seore seale, to the NAEP seale via 

the same linking funetion above. 


naep _ /vnaep 
^sygb ~ i^sygb ' 


y state 

■^sygb ^naep 

^sygb 

^state 

rsygh 


We report bias, RMSE, and eorrelations with the reported mapped standards in the full paper. 
Diserepaneies between these mapped standards and those reported by Bandeira de Mello et al. 
(2015) are a partial indieation of misalignment between the state EDEaets and NAEP 
populations.^ 


Findings / Results: 

Given spaee limitations, we limit our diseussion to findings from the first and most direet 
reeovery benehmark: means and standard deviations of NAEP TElDAs. Table 1 shows bias, 
RMSE, eorrelations, and precision-adjusted correlations for the 17-20 TElDAs by subject, grade, 
and year (EDEacts is due to release 2013 restricted-use data shortly). Eigure 1 shows a 
scatterplot of linked and NAEP-reported means. Precision-adjusted correlations are high. 
However, persistent positive bias across TUDAs (around a 10th of a NAEP standard deviation) 
indicates systematic high performance of TUDA districts with their respective state test score 
distributions, leading to higher-than-expected NAEP mappings. 

Possible explanations for these discrepancies include disproportionately higher true performance 
on state rather than NAEP content relative to other districts, differences in motivation across 
tests by district, and relative inflation (for example, one district with a positive discrepancy had a 
known cheating scandal on the state test in one year; the discrepancy reduced by the next 
administration). Note that the expected bias from a linear mapping is near 0 by design. Thus, the 
reported RMSE underestimates the variation we would expect if we could perform this analysis 
on all districts, not just TElDAs. We describe results for other criteria in the full paper. 


Conclusions: 


A nationwide district-level dataset of means and standard deviations will be a valuable tool for 
future descriptive and causal analysis if and only if it is valid for its intended research purposes. 
We use a range of validation approaches, here and in the full paper, to demonstrate that overall 
recovery as indicated by correlations is strong, but bias, although small, is systematic for certain 
districts. Eollowing the publication of this paper, we intend to release this dataset to the public, 
complete with documentation detailing its caveats. We hope to benefit from feedback at the 
SREE spring conference to ensure that this dataset, unprecedented in its geographical detail, 
advances valid future research analyses and conclusions. 


^ Incongruence between state and NAEP distributional forms may also explain discrepancies. Incongruence causes 
linear and equipereentile linkages to diverge but is not a threat to inferences from the linear linkage. 
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Appendices 

Not included in page count. 
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Appendix B, Tables and Figures 

Not included in page count. 

Table 1 : Estimated correlations between NAEP-linked EdEacts estimates and NAEP TEIDA estimates 


Subject 


Reading 


Math 


Linear Calibration Precision-Adjustment Linear Calibration-EPE Precision-Adjustment- EPE 

Adj. Adj. 


Grade 

Year 

Correlation 

Bias 

RMSE 

SE_State 

SE_NAEP 

Corr. 

Correlation 

Bias 

RMSE 

SE_State 

SE_NAEP 

Corr. 

A 

2009 

0.93 

1.40 

4.10 

1.57 

1.65 

0.94 

0.94 

2.18 

4.42 

1.88 

1.67 

0.96 

*T 

2011 

0.94 

0.80 

4.48 

1.31 

1.54 

0.96 

0.94 

1.61 

3.78 

1.37 

1.59 

0.97 

Q 

2009 

0.87 

1.63 

4.19 

1.42 

1.67 

0.89 

0.93 

1.86 

3.73 

1.64 

2.13 

0.97 

o 

2011 

0.95 

1.26 

3.25 

1.13 

1.30 

0.97 

0.95 

1.46 

2.98 

1.17 

1.27 

0.97 

A 

2009 

0.95 

3.77 

5.16 

2.46 

1.23 

0.95 

0.96 

4.06 

5.90 

3.40 

1.26 

0.97 

*T 

2011 

0.95 

2.56 

4.87 

1.05 

1.02 

0.95 

0.97 

3.04 

4.74 

1.10 

1.01 

0.98 

Q 

2009 

0.92 

4.28 

5.98 

1.33 

1.37 

0.92 

0.96 

4.74 

6.15 

1.40 

1.44 

0.97 

o 

2011 

0.93 

3.01 

5.18 

1.13 

1.25 

0.94 

0.98 

3.25 

3.96 

1.19 

1.27 

0.99 
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Mean Score, Ed Facts, on NAEP Scale 


Figure 1 : NAEP-linked EDFacts and NAEP TEIDA estimated means, grades 4-8, 2009 and 201 E 


Mean Test Scores For NAEP TUDA Districts 
Math & ELA, 2009 & 201 1 , Grades 4 & 8 
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